A Statistical and Rule-based Method for Chunking Verbal Units in Thai Texts

نویسندگان

Pimnapa Atsawintarangkun

Thanaruk Theeramunkong

Choochart Haruechaiyasak

چکیده

Tokenizing a text into a sequence of words is an important process towards text interpretation. This process is required in many applications such as text summarization, semantic search, and machine translation. Instead of splitting into words, recently there have been works on chunking into units which are larger than words. Text chunking is a process to divide a running text into non-overlapping groups of words, which have meaningful contents, such as named entities and verbal units. In this work, we explore three layers of verbal units, called (1) verbal sequences, (2) verb phrases (i.e., verbal chunks, causative forms and event occurrences), and (3) elementary discourse units (EDUs). As the basic layer, a verbal sequence is defined as a single verb or a sequence of contiguous verbs without any interrupting nouns or particles. For example, A Statistical and Rule-based Method for Chunking Verbal Units in Thai Texts

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chunking Using Conditional Random Fields in Korean Texts

We present a method of chunking in Korean texts using conditional random fields (CRFs), a recently introduced probabilistic model for labeling and segmenting sequence of data. In agglutinative languages such as Korean and Japanese, a rule-based chunking method is predominantly used for its simplicity and efficiency. A hybrid of a rule-based and machine learning method was also proposed to handl...

متن کامل

تعیین مرز و نوع عبارات نحوی در متون فارسی

Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...

متن کامل

Japanese Unknown Word Identification by Character-based Chunking

We introduce a character-based chunking for unknown word identification in Japanese text. A major advantage of our method is an ability to detect low frequency unknown words of unrestricted character type patterns. The method is built upon SVM-based chunking, by use of character n-gram and surrounding context of n-best word segmentation candidates from statistical morphological analysis as feat...

متن کامل

Japanese Named Entity Extraction with Redundant Morphological Analysis

Named Entity (NE) extraction is an important subtask of document processing such as information extraction and question answering. A typical method used for NE extraction of Japanese texts is a cascade of morphological analysis, POS tagging and chunking. However, there are some cases where segmentation granularity contradicts the results of morphological analysis and the building units of NEs, ...

متن کامل

A Punjabi Grammar Checker

This article provides description about the grammar checking software developed for detecting the grammatical errors in Punjabi texts and providing suggestions wherever appropriate to rectify those errors. This system utilizes a full-form lexicon for morphology analysis and rule-based systems for part of speech tagging and phrase chunking. The system supported by a set of carefully devised erro...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

A Statistical and Rule-based Method for Chunking Verbal Units in Thai Texts

نویسندگان

چکیده

منابع مشابه

Chunking Using Conditional Random Fields in Korean Texts

تعیین مرز و نوع عبارات نحوی در متون فارسی

Japanese Unknown Word Identification by Character-based Chunking

Japanese Named Entity Extraction with Redundant Morphological Analysis

A Punjabi Grammar Checker

عنوان ژورنال:

اشتراک گذاری